First of all, thanks for your interest in So To Speak. I hope you find it useful, friendly, and fun.
Second, if this program is redistributed then all documentation must be included and no part of it may be modified without the permission of the author (me).
Third, this program is shareware. I╒m asking for (US)$20 for use of the program. If you don╒t want to, or can╒t, pay that amount, please at least send a postcard from wherever you are. My address is at the end of this document. The incentive for sending in the shareware fee is that I╒ll be encouraged to continue enhancing the program. I╒ve received some great postcards from beta-testers, and I really appreciate them. They help keep me going at 3:00 am.
So To Speak demonstrates Apple╒s new Speech Manager. The Speech Manager is part of Apple╒s PlainTalk¬ technologies, which include both speech recognition and text-to-speech synthesis. The Speech Manager is the text-to-speech part of PlainTalk. So To Speak provides access to most of the built-in capabilities of the Speech Manager. With So To Speak, you can have multiple voices synthesizing speech at once, vary the modulation, pitch, rate, and volume of the synthesized speech, and exercise full control over the pausing and stopping points. You can also change the voice used for synthesis, the way characters and numbers are interpreted, and whether the text entered should be treated as text or phonemes. The windows have been designed to work on any monitor size and depth; they are all resizable.
Requirements
So To Speak has a couple of requirements which must be satisfied for it to operate properly.
Ñ a Macintosh running System 7.0 or later
Ñ Speech Manager 1.0 or later (which of course has it's own requirements)
Ñ memory (more on this later╔)
Unfortunately, I can't supply you with any of these items (well, I do have a ton of old 256 KB SIMMs!). The Mac and memory you can get at your local dealer. As for the Speech Manager software, I could license it from Apple and then distribute it freely with So To Speak, but I don't have $1500. If you have access the Internet, you can ftp it from ftp.apple.com. Last time I checked, it was in the /dts/mac/sys.soft/speech directory. AOL and CompuServe users will have to check out their services for the software. Your dealer should also be able to get this software for you.
Talk To Me About The Speech Manager
The Speech Manager is a new System extension from Apple. There is a misconception amongst some users on Usenet that it is an application; it isn╒t, rather it provides text-to-speech capabilities to applications written to take advantage of it. The purpose of the Speech Manager is to provide Macintosh applications easy access to text-to-speech (╘tts╒) synthesis. TTS has many advantages, some outlined by Ian Witten in his book ╥Principles of Computer Speech╙ (Academic Press, 1982):
Ñ leaves the hands and eyes free for other tasks
Ñ easy access to remote data without special equipment (e.g. another human)
Ñ enables disabled persons.
(For those interested in a detailed discussion of tts Witten╒s book is an excellent starting point, albeit a bit dated.) The new software is a significant improvement over the original version of MacInTalk in that it doesn╒t require the user to input text as phonemes to achieve decent results. Programs like Talking Moose which employed the original MacInTalk will not work with the new version. The Speech Manager has a scalable architecture, meaning that it will take advantage of high-performance Macintoshes, while still being able to run on the lowest-powered machines. Other enhancements include: asynchronous speech generation, full set of speech controls, allowance of embedded commands, support for multiple, independent speech channels and pronunciation dictionaries.
Apple implemented the scalable architecture by dividing the voice synthesizers into two categories, high- and ╘standard-quality╒. The standard-quality software is called ╘MacInTalk II╒ while the high-quality synthesizer is referred to as ╘MacInTalk Pro.╒ (For those trivia buffs, the code name for MacInTalk Pro was ╘Gala Tea╒.) The requirements of each is given in the table below. For non-AV Macs, the full Speech Manager includes six files, including the Speech Manager extension itself, and various synthesizers. I use the term ╘Speech Manager╒ to refer to the whole package. The package is pictured below. A short description of each file follows.
The Speech Manager Files and Information
File Information Requirements
Speech Manager The basic System extension for tts.
Includes the MacInTalk II synthesizer and a default voice (╘Marvin╒). System: > 6.0
RAM: 150 KB
CPU: 68000
MacInTalk Voices
(real original icon!) Provides nine additional voices for the MacInTalk II synthesizer. Same as above.
Other: Speech Manager extension
PlainTalk¬ Text-To-Speech System extension for the high-quality synthesizer.
Includes a default voice (╘Female Voice, Compressed╒). System: > 7.0
TTS Male Voice A high-quality, uncompressed female voice. RAM: ┼ 2.7 MB
Other: PlainTalk¬ Text-To-Speech extension
TTS Male Voice, Compressed A highly-compressed version of the high-quality male voice. RAM: ┼ 825 KB
Other: PlainTalk¬ Text-To-Speech extension
Note the dependencies among the files. For example, if you want to use the high-quality male voice, you must install the TTS Male Voice, PlainTalk¬ Text-To-Speech, and Speech Manager files. To install the Speech Manager files, I recommend placing all six files in the Extensions folder of the System Folder. If you drag-install the files, the extensions will be placed in the Extensions folder and the voice files will be installed in the root level of the System Folder. I place them all in the Extensions folder because it is neater and easier to remember, and it works fine. For those people still using System 6, place the Speech Manager and MacInTalk Voices files in the System Folder. You have no need for the other four files.
(More trivia: the MacInTalk II synthesizer utilizes approximately 60% of a Mac Classic CPU time while producing speech. MacInTalk Pro assumes about 25% of a IIfx╒s time.)
The Speech Manager works in the following way: Let╒s say you have a program, like So To Speak, which supports the Speech Manager. How does that application have text in one of its windows or menus spoken? Naturally, it is up to the developer to provide some way for the user to enter text or indicate what menu you want spoken or whatever. Assume that we type text in a text field. The application takes that text and tells the Speech Manager that it would like to have the text synthesized into speech. If the application is somewhat sophisticated, it provides a way for the user to change the voice used to synthesize, and perhaps the rate of the speech. If not, the Speech Manager will use a default voice and rate. When the time comes to actually speak the text, the application passes it to the Speech Manager, which acts as a dispatcher. The Speech Manager then dispatches the request to the proper synthesizer. The synthesizer is responsible for analyzing and converting the text and then producing the actual audio output. The entire process is diagrammed below.
The majority of differences between the standard- and high-quality synthesizers is found in the analysis and conversion stages. Both synthesizers implement a two-stage algorithm. The first stage is conversion of the text to its phonemic representation. A phoneme is a basic unit of human-produced sound. (Even more trivia: there are approximately 56 phonemes in American English. Linguists have recorded over 600 different phonemes in use in languages worldwide.) The second stage of the process is to map the phonemes to audible sounds, which are then reproduced through the speaker.
Each of these two stages can vary in complexity. For example, in the simplest case of conversion, the synthesizer could ignore all punctuation, grammar, abbreviations, etc., and simply convert the text into the first matching phoneme. At the other extreme is the synthesizer which considers not only punctuation and grammar, but stress, duration, and modulation. Obviously, the more complex the conversion, the larger the algorithm. In the production stage, at least two routes can be taken as well. Let╒s assume that all the phonemes have previously been recorded and are stored in the synthesizer file. The easiest approach would be to simply ╘paste╒ the recorded sounds of the phonemes together and then play them. The resulting speech would sound blocky and unnatural to the listener. At a minimum, the synthesizer would need to ╘bleed╒ two bordering phonemes into each other so that the transition between the two was less startling to the listener. Further enhancement would vary the pitch, modulation, and duration of each phoneme to match the intended tone of the typed text.
Some miscellaneous details. Currently, the Speech Manager only supports American Standard English. According to Apple, foreign language versions are in development, but no ship dates have been mentioned. It is an immense task to produce a solid synthesizer, and I would expect that it would be at least a year (late 1994) before the first foreign (to North American╒s :-) synthesizer to be delivered. My understanding is that Japanese and German versions are going to be the next out the door.
The Speech Manager allows the developer to vary the pitch, rate, and modulation of the speech channel. There is a relationship between these pitch and modulation, although it isn╒t enforced in the interface of version 1.0 of So To Speak. The relationship is defined as follows:
Hertz = 440.0 * 2((BasePitch - 69) / 12)
BasePitch of 1.0 ┼ 9 Hertz
BasePitch of 39.5 ┼ 80 Hertz
BasePitch of 45.8 ┼ 115 Hertz
BasePitch of 50.4 ┼ 150 Hertz
BasePitch of 100.0 ┼ 2637 Hertz
and
Maximum pitch = BasePitch + PitchMod
Minimum pitch = BasePitch - PitchMod
Maximum Hertz = BaseHertz * 2(+ ModValue / 12)
Minimum Hertz = BaseHertz * 2(- ModValue / 12)
Given:
BasePitch of 46.0 (┼ 115 Hertz),
PitchMod of 2.0,
Then:
Maximum pitch = 48.0 (┼131 Hertz),
Minimum pitch = 44.0 (┼104 Hertz)
One last item, this time about ╘rate.╒ The idea is that ╘rate╒ corresponds to words-per-minute. While this measure is a useful one when thinking about ╘rate╒ it is not the way the synthesizer processes the text. The same rate may have produce different results from different synthesizers. Try it out for yourself. One rule that does apply is that a doubling of the rate should halve the time needed to speak any particular text.
The Speech Manager does NOT require Sound Manager 3.0.
Now, about memory. There is another common misconception among users about memory allocation to applications using tts. When So To Speak, or any other tts application, requests text synthesis, the majority of the needed memory is allocated in the System╒s heap, not the application╒s. If you use the Male Voice for example, So To Speak will need about 100K to open the necessary data structures. The synthesizer and data will be opened in the System╒s heap, so it will require about 2.7 MB beyond what it already uses! If you are low on memory, increasing the memory partition of So To Speak will only exacerbate the problem since that will make that much less memory available to the System. The confusion is understandable, since this situation is contrary to what most Mac users have become accustomed to.
So To Speak Details
Now that you know a little more about the Speech Manager, let╒s look at how we can use it with So To Speak. There are two ways to start So To Speak. The first is to simply double-click on its icon in the Finder. After a few seconds, the application opens two speech windows, called ╘So To Speak First╒ and ╘So To Speak Second.╒ The other way to start the program is to drag the icon of a text file onto So To Speak╒s icon. If you start it this way, not only will the two speech windows be opened, but the text file will be opened in a third window. Your desktop should look like the one below.
The So To Speak Desktop
The text window in So To Speak behaves like a simple text editor. You can cut, copy, and paste text into the window, as you would expect, but you can only have one text file open at a time. You can also save changes you make to any text to a file.
Editing text is well, okay, but the fun part of So To Speak are the speech windows. Here╒s a screen shot of one of them.
As you can see, So To Speak provides you with a number of ways to alter the synthesis of speech. At the top of the window is a pop-up menu presenting a list of the voices available on your machine. Which voices are available depends on which ones you have installed and the type of Macintosh you have. See the previous section for more information on installing voices. Selecting a different voice has a number of consequences. First, any speech that was being synthesized, even paused speech, is stopped immediately. Second, the positions of the four sliders will be updated to reflect the defaults for that voice. Third, all the modes will be reset to their ╘normal╒ position. Fourth, any dictionaries which had been added will be released.
Once you╒ve selected a voice, type or paste some text into the ╘Text to speak:╒ box. To have So To Speak read the text to you, simply hit the ╘Talk!╒ button. When you do, the text will be processed and then synthesized! While the text is being read, the ╘Pause╒ and ╘Stop╒ buttons will become active. Both behave according to the mode set in the ╘Pause/Stop:╒ pop-up menu. The default mode is ╘Immediately╒ which means just that: pressing the ╘Pause╒ or ╘Stop╒ button will cause the synthesis to do so as soon as possible. The other modes are ╘At end of word╒ and ╘At end of sentence.╒ Try them out!
Unlike paused speech, stopped speech cannot be continued. When you press the ╘Stop╒ button all processed text is discarded and synthesis is halted. Pressing the ╘Pause╒ button does not discard processed text, it merely interrupts the synthesis. You can easily restart the paused synthesis where it left off by pressing the ╘Continue╒ button. If you decide to stop the paused speech, simply press the ╘Stop╒ button.
The four sliders in the characteristics area of the window are used to vary the indicated parameters. The best way to learn about the effect of each parameter is to experiment. Play with the darn things! I tried to make this as easy as possible by allowing you to change their position while speech is being synthesized. the best way to see how they work is to paste, or type, a large block of text into the ╘Text to speak:╒ window, then press the ╘Talk!╒ button. When the computer begins speaking, drag one of the sliders (I like the ╘Rate╒ slider the best) to the right. The speech should speed up. The lag between the time you make the change and its fulfillment depends somewhat on the speed of your Mac. Also, you must release the mouse button after dragging the slider for the program to sense that you╒ve changed the value.
What about the character and number ╘Modes╒ radio buttons? They allow you to turn on and off the attempt to interpret characters as words and numbers as ordinals, dollar amounts, years, etc. For example, in the ╘normal╒ position the characters ╘Macs Rule╒ would be pronounced as ╘Macs Rule.╒ In the ╘literal╒ mode, it would be pronounced ╘capital-M-a-c-s-space-capital-R-u-l-e.╒ Similarly, with the ╘Numbers╒ radio button in the ╘normal╒ position ╘123╒ would be pronounced as ╘one hundred twenty-three.╒ In the ╘literal╒ position, the same text would be read ╘one-two-three.╒
The ╘Text input╒ radio button has a slightly different effect. In the ╘normal╒ position, the synthesizer will interpret the text in the ╘Text to speak:╒ field as, well, text. When you click on the ╘phonemes╒ button, two things will happen. First, the text in the ╘Text to speak:╒ field will be converted to it╒s phonemic representation. Second, the synthesizer will expect phonemes, not text, when it is commanded to speak the contents of the field.
The first point bears repeating and further explanation:
NOTE: Clicking on the ╘phonemes╒ radio button will convert any text in the ╘Text to speak:╒ field to its phonemic representation. Clicking back on the ╘normal╒ button WILL╩NOT convert the phonemes to text; it simply restores the previous contents of the field. The Speech Manager provides no mechanism for converting phonemes to text.
Why bother then? Well, a future version of So To Speak will allow you to create and edit pronunciation dictionaries. These dictionaries are comprised of words and their preferred pronunciation. The words are listed normally, the pronunciation must be given phonemically. Thus you need a way to create and tweak the phonemic representation of a word. So To Speak gives you that capability now with the ╘phonemes╒ radio button.
To help you understand and work with phonemes and prosody, So To Speak provides two palettes listing the symbols the Speech Manager accepts. To open the palettes, select ╘Phoneme Symbols╒ or ╘Prosody Symbols╒ under the Windows menu.
There are two ways to use the palettes. First, they can help you understand how the Speech Manager works. Type some text into the ╘Text to speak:╒ field and convert it to its phonemic representation. Then, examine the resulting phonemes and see if you can figure out how the Speech Manager has processed the text. The second way to use the palettes is to build your own words from phonemes. Double-clicking on any element in either palette will copy the symbol to the ╘Text to speak:╒ field of the active window.
The last control in the window is the ╘Add Dictionaries╔╒ button. So To Speak is distributed with a sample dictionary, provided as a ResEdit file. To use the dictionary, click the ╘Add Dictionary╔╒ button. The standard open file dialog will open. Find the ╘Sample Dictionary╒ file and click the ╘Open╒ button. Type ╘Mississippi╒ in the text field of the active speech window. Press the ╘Talk!╒ button and hear some humor Apple-style.
Extra Credit (50 points)
Convert text to phonemes. Compare the phonemic representation generated by the Speech Manager to that in a standard dictionary. Can you find any particular weak points or consistent mistakes? Compare the conversion provided by different synthesizers.
Other Information
This application is released 'as is.' The author is not responsible for any damage, destruction or loss of data caused by its use. So To Speak is shareware. The requested amount for it╒s use (US)$20., however, any amount would be appreciated. My addresses are:
Eric Weidl
142 N. 11th Ave.
St. Charles, IL 60174
Internet: e-weidl@uchicago.edu (preferred)
AppleLink: HATCHERY
If you have any problems, suggestions or comments, I would love to hear them. If you are reporting a bug, please give me as detailed information as you possibly can. I'll even look at MacsBug dumps (sc6 and sc7 are particularly helpful!) I'll be able to respond more quickly if you contact me by email.
So To Speak was written in Serius Workshop. I've written a Speech object for Workshop using THINK C 5.0. If you have Serius Workshop and are interested in purchasing the object send me some email.
If So To Speak doesn╒t accelerate your clock, check out SpokesDaemon (a faceless background application which processes other application╒s requests for tts services) or Speech FKEY (an FKEY which reads any text on the clipboard.)
Thanks to all the beta-testers!
C O P Y R I G H T N O T I C E
Copyright ⌐ Eric J. Weidl, 1993. All Rights Reserved.
Bulletin Board Services/Shareware CD╨ROMs
This package may be distributed on BBSes or CD╨ROMs as is. It must include this document and either application or this document may not be modified in any way. The software may not be sold or distributed for profit, or included with other software which is sold or distributed for profit, without the written permission of the author. I understand that CD╨ROM distributors have to recover the cost of producing the CD╨ROM, so charging for it's distribution under such circumstances is permitted. I would love to see this software distributed and used as widely as possible!
Bugs
Ñ So To Speak will only look for 'dict' resources with an id of 1.